MCMC

Jesse Brunner

Motivation for MCMC

Many models are too complex for quadratic approximation or other short cuts.

We need a general method of numeric integration to explore the posterior effectivly.

MCMC \(\lim_{t\to\infty}\) posterior

What’s in a name? Markov chain Monte Carlo

Monte Carlo: “repeated random sampling to obtain numerical results” -wikipedia

Markov chain: stochastic model in which state at time \(t\) depends only on previous state at \(t-1\)

Imagine a posterior

Imagine this one-dimensional posterior

  • The x-axis is our parameter, \(\theta\).
  • The y-axis is the posterior probability.

Imagine a posterior

Imagine this one-dimensional posterior:

  • The x-axis is our parameter, \(\theta\).
  • The y-axis is the posterior probability.
    • actually, is proportional to the posterior
    • we don’t know the probability of the data, \(\text{Pr}(x)\).

Step 1

Imagine we take a guess for \(\theta\) This is the first step of our (Markov) chain

Step 1

We can then find the height for this guess

height = likelihood of data given \(\theta\) \(\times\) prior of \(\theta\)
= \(\text{Pr}(x|\theta) \times \text{Pr}(\theta)\).

Step 2

Choose a proposal for a new value on the chain. - Our proposal distribution (blue) is centered on our last guess. - Can be any symmetric shape.

Step 2

Step 2

Find height (posterior-ish value) for our proposal

Step 3

Calculate the ratio of those heights: \[ r = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta_{t-1})\text{Pr}(\theta)} = \frac{0.078}{0.07} = 1.124 \]

Why ratios are clever

We want to calculate the ratio of posterior probabilities:

\[ r = \frac{\text{Pr}(\theta_p | x)}{\text{Pr}(\theta_{t-1} | x)} = \frac{\frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x)}}{\frac{\text{Pr}(x|\theta_{t-1})\text{Pr}(\theta_{t-1})}{\text{Pr}(x)}} \] but we do not know (and cannot calculate) \(\text{Pr}(x)\).

However, they cancel out! We just need the numerators! \[ r = \frac{\text{Pr}(\theta_p | x)}{\text{Pr}(\theta_{t-1} | x)} = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta)\text{Pr}(\theta)} \]

Step 3

Choose a random uniform variable, \(U \sim \text{Uniform}(0,1)\)

If \(r > U\), accept the proposal & add it to the chain. If \(r < U\), reject and try a new proposal.

Repeat steps 2 & 3

Repeat steps 2 & 3

Repeat steps 2 & 3

\[ r = \frac{0.061}{0.078} = 0.781 ; U = 0.89 \]

Since \(r < U\), reject proposal & try a new proposal.

One more time

One more time

\[ r = \frac{0.079}{0.078} = 1.004 \] \(U=0.212\)

Accept proposal & add to chain

Chain so far

Chain after a while

Turn it on it’s side

It works!

This is the Metropolis Algorithm

Did you catch the problem?

What if we get proposals outside of possible boundaries?

  • lots of rejections
  • inefficient sampling

Solution: asymmetric proposal distribution

However, proposal distribution is \(a\)symmetric:

\[ \text{Pr}(\theta_{t-1} \rightarrow \theta_p) \neq \text{Pr}(\theta_{p} \rightarrow \theta_{t-1}) \]

Metropolis-Hastings algorithm

Adjust the ratio to: \[ r = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta_{t-1})\text{Pr}(\theta_{t-1})}\times \frac{\text{Pr}(\theta_{t-1}|\theta_p)}{\text{Pr}(\theta_p|\theta_{t-1})} \]

and it works!

But… higher dimensional problems

Imagine a proposed jump from one of these points could easily be far from posterior density

  • Gets worse with higher N
  • Many to most proposals are rejected \(\longrightarrow\) inefficient

Solution: HMC

Hamiltonian Monte Carlo

  • pretends the (negative-log) posterior is a surface
  • simulates a ball rolling along that (without friction)
  • produces proposals informed by topography
  • otherwise like Metropolis-Hastings

HMC

First, take the negative log, so high probability = low areas

HMC

First, take the negative log, so high probability = low areas

HMC

Take the first guess

HMC

…and give it a bit of momentum (in a direction of parameter space)

HMC

Track it’s movement for a certain amount of “time” using Hamiltonian equations

\[ \begin{align} \text{Total Energy} &= \text{Potential} + \text{Kinetic}\\ E &= U(\theta) + K(\text{momentum}) \end{align} \]

After a bit of time, point at end is the new proposal

  • Proposals tend to be in areas with higher probability density

HMC

Track it’s movement for a certain amount of “time” using Hamiltonian equations

\[ \begin{align} \text{Total Energy} &= \text{Potential} + \text{Kinetic}\\ E &= U(\theta) + K(\text{momentum}) \end{align} \]

After a bit of time, point at end is the new proposal

  • Proposals tend to be in areas with higher probability density

HMC details

No analytic solution to Hamiltonian for most problems, so:

  • solve it numerically over a series of steps (Leapfrog algorithm)

HMC details

No analytic solution to Hamiltonian for most problems, so:

  • solve it numerically over a series of steps (Leapfrog algorithm)
  • Stan will tune step size and number of leapfrog steps
    • too many steps and we end up at the bottom, every time
    • too coarse and we don’t follow the posterior
  • Total energy (\(= \text{Potential} + \text{Kinetic}\)) at start should equal the energy at the end if we followed posterior surface (No friction!)
    • gives a warning if these diverge

Proposal acceptance actually reflects the difference in energy (i.e., how well we followed the path of the ball)

  • unless there were divergences, we should accept
  • divergences suggest a problem we need to fix

HMC details

Overall, each proposal requires many more calculations (many steps, calculating kinetic energy & momentum at each), but proposals are much better / rarely rejected, so much more efficient overall

  • Most of the time with Stan is compiling code in C++
  • Running is fast!

Assumes a continuous likelihood surface, so no discrete parameters!

  • WinBugs/JAGS allow for \(\theta \in \{0,1\}\) (can be useful for occupancy and similar models)
  • In HMC we “integrate out” these discrete parameters… faster & simpler

Other problems involve hard-to-simulate likelihood surfaces (more later)